Introduction and Motivation

The five leading causes of death in the United States from 1999 to 2014 are cancer, heart disease, unintentional injury, chronic lower respiratory disease, and stroke. The dataset includes the U.S. Department of Health and Human Services public health regions. Therefore, we can investigate the leading causes of death of each region, and develop accordingly public health policy and remedies.

Data Descriptions

Approaches

Visualizations

Conclusion

Notes

cod_data = read_csv("./data/NCHS_-_Potentially_Excess_Deaths_from_the_Five_Leading_Causes_of_Death.csv") %>%
  clean_names() %>%
  na.omit() %>%
  filter(!(state == "United States")) %>%
  separate(., percent_potentially_excess_deaths, into = c("percent_excess_death"), sep = "%") %>% 
  mutate(percent_excess_death = as.numeric(percent_excess_death), mortality = observed_deaths/population * 10000, mortality = as.numeric(mortality)) %>% 
  select(year, age_range, cause_of_death, state, locality, observed_deaths, population, expected_deaths, potentially_excess_deaths, percent_excess_death, mortality, hhs_region)
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   `Cause of Death` = col_character(),
##   State = col_character(),
##   `State FIPS Code` = col_character(),
##   `HHS Region` = col_integer(),
##   `Age Range` = col_character(),
##   Benchmark = col_character(),
##   Locality = col_character(),
##   `Observed Deaths` = col_integer(),
##   Population = col_integer(),
##   `Expected Deaths` = col_integer(),
##   `Potentially Excess Deaths` = col_integer(),
##   `Percent Potentially Excess Deaths` = col_character()
## )
## Warning: Too many values at 191748 locations: 1, 2, 3, 4, 5, 6, 7, 8, 9,
## 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...
##columns removed

 #"state_fips_code"                      "benchmark"   "potentially_excess_deaths" "percent_excess_death"      "mortality"   

xs2329:

cod_data %>%
  group_by(cause_of_death) %>% 
  ggplot(aes(x = locality, y = mortality, fill = cause_of_death)) +
  geom_col() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Locality vs. Mortality") +
  labs(x="locality", y="mortality") 

This bar graph shows the distribution of cause of death within mortality in the three geographic regions: Metropolitan, Nonmetropolitan and All. We observe that the weight of each composition of cuases of death is the same regardless of the locality. Nonmetropolitan area seems to have the highest mortaility, the number of deathes observed in every 10000 people.

cod_data %>%
  group_by(cause_of_death) %>% 
  ggplot(aes(x = locality, y = mortality)) + geom_boxplot(aes(color = cause_of_death), na.rm = T) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title="Locality vs. Mortality") +
  labs(x="locality", y="mortality") 

We observe that the median of mortaility is the highest among all 5 causes of death, followed by heart disease, unintentional injury, heart disease, chronic lowee respiratory disease and stroke. This pattern remains consistant when we subdivide locaility into metropolitan region and nonmetrpolitan region, which is identical to our previous finding. All 5 cuases of death are right skewed regardless of the locality.

mortality_lm = lm(mortality ~locality, data = cod_data)
summary(mortality_lm)
## 
## Call:
## lm(formula = mortality ~ locality, data = cod_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1454 -2.9547 -1.2983  0.9988 19.8959 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              4.10331    0.01672 245.474   <2e-16 ***
## localityMetropolitan    -0.22576    0.02372  -9.519   <2e-16 ***
## localityNonmetropolitan  1.14451    0.02438  46.942   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.317 on 191745 degrees of freedom
## Multiple R-squared:  0.01821,    Adjusted R-squared:  0.0182 
## F-statistic:  1778 on 2 and 191745 DF,  p-value: < 2.2e-16

interpretation on linear regression:

Here we obtain the linear regression model of mortality = 4.103 - 0.226localityMetropolitan -1.145localityNonmetropolitan. We expect to see 0.22576 less deaths for every 10000 people in the metropolitan region as compared to all regions. We expect to see 1.14451 more death for every 10000 people in the nonmetropolitan region as compared to all regions.

cod_data %>%
  mutate(year = as.factor(year))
## # A tibble: 191,748 x 12
##      year age_range cause_of_death   state        locality observed_deaths
##    <fctr>     <chr>          <chr>   <chr>           <chr>           <int>
##  1   2005      0-49         Cancer Alabama             All             756
##  2   2005      0-49         Cancer Alabama    Metropolitan             556
##  3   2005      0-49         Cancer Alabama Nonmetropolitan             200
##  4   2005      0-49         Cancer Alabama             All             756
##  5   2005      0-49         Cancer Alabama    Metropolitan             556
##  6   2005      0-49         Cancer Alabama Nonmetropolitan             200
##  7   2005      0-49         Cancer Alabama             All             756
##  8   2005      0-49         Cancer Alabama    Metropolitan             556
##  9   2005      0-49         Cancer Alabama Nonmetropolitan             200
## 10   2005      0-54         Cancer Alabama             All            1346
## # ... with 191,738 more rows, and 6 more variables: population <int>,
## #   expected_deaths <int>, potentially_excess_deaths <int>,
## #   percent_excess_death <dbl>, mortality <dbl>, hhs_region <int>
p <- cod_data %>%
  plot_ly(
    x = ~expected_deaths, 
    y = ~observed_deaths, 
    size = ~population, 
    color = ~cause_of_death, 
    frame = ~hhs_region, 
    text = ~state, 
    hoverinfo = "text",
    type = 'scatter',
    mode = 'markers'
  ) 
p
<<<<<<< HEAD

<<<<<<< HEAD

=======
>>>>>>> 066c9212fe9218f503b990545ce88ab848f64098
### Question of concern: If percent excess death has significant difference between metro and non-metro groups.
gp_cod_data = cod_data %>%
  group_by(year, locality) %>% 
  summarise(mean_ped = mean(percent_excess_death))

ggplot(gp_cod_data, aes(x = year, y = mean_ped, fill = locality)) + geom_point(stat = "identity") + geom_smooth() +
facet_grid(. ~ locality) 
## `geom_smooth()` using method = 'loess'

gp_locality_lm = lm(mean_ped ~locality , data = gp_cod_data)
summary(gp_locality_lm)
## 
## Call:
## lm(formula = mean_ped ~ locality, data = gp_cod_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5667 -1.0325 -0.3470  0.7922  3.2937 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              33.8866     0.4071  83.238  < 2e-16 ***
## localityMetropolitan     -2.0599     0.5757  -3.578   0.0012 ** 
## localityNonmetropolitan   8.0460     0.5757  13.975 1.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.35 on 30 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9145 
## F-statistic: 172.1 on 2 and 30 DF,  p-value: < 2.2e-16
broom::tidy(gp_locality_lm)
##                      term  estimate std.error statistic      p.value
## 1             (Intercept) 33.886633 0.4071070 83.237648 4.783207e-37
## 2    localityMetropolitan -2.059931 0.5757363 -3.577906 1.200041e-03
## 3 localityNonmetropolitan  8.045979 0.5757363 13.975111 1.132457e-14

The regression model is mean percent_excess_death = 33.89 - 2.05*Metropolitan + 8.046Nonmetropolitan. For people in metropolitan, the expected percent excess death has a decrease of 2.06. For people in non-metropolitan aream the expected percent excess death has an increase of 8.046. The p-values show these predictors are both significant. The adjusted R squared is 0.91, which indicates the regression model explains 91% of variation in mean percent excess death due to the variation in locality.

<<<<<<< HEAD

=======